28 research outputs found

    Learning languages from parallel corpora

    Full text link
    This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material

    Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora

    Get PDF
    The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as partof- speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly interconnected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries

    Exploring Properties of Intralingual and Interlingual Association Measures Visually

    Full text link
    We present an interactive interface to explore the properties of intralingual and interlingual association measures. In conjunction, they can be employed for phraseme identification in word-aligned parallel corpora. The customizable component we built to visualize individual results is capable of showing part-of-speech tags, syntactic dependency relations and word alignments next to the tokens of two corresponding sentences

    Binomials in Swedish corpora – ‘Ordpar 1965’ revisited

    Full text link
    This paper describes a corpus study on Swedish binomials, a special type of multi-word expressions. Binomials are of the type "X conjunction Y" where X and Y are words, typically of the same part-of-speech. Bendz (1965) investigated the various use cases and functions of such binomials and included a list of more than 1000 candidates in his appendix. We were curious to what extent these binomials can still be found in modern corpora. We therefore checked this list against the Swedish Europarl and OpenSubtitles corpora. We found that many of the binomials are still in use today even in these diverse text genres. The relative frequency of binomials in Europarl is much higher than in OpenSubtitles

    SwissBERT: The Multilingual Language Model for Switzerland

    Full text link
    We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at https://github.com/ZurichNLP/swissbert

    NLP Corpus Observatory – Looking for Constellations in Parallel Corpora to Improve Learners’ Collocational Skills

    Full text link
    The use of corpora in language learning, both in classroom and self-study situations, has proven useful. Investigations into technology use show a benefit for learners that are able to work with corpus data using easily accessible technology. But relatively little work has been done on exploring the possibilities of parallel corpora for language learning applications. Our work described in this paper explores the applicability of a parallel corpus enhanced with several layers generated by NLP techniques for extracting collocations that are non-compositional and thus indispensable to learn. We identify constellations, i.e. combinations of intra- and interlingual relations, calculate association scores on each relation and, based thereon, a joint score for each constellation. This way, we are able to find relevant collocations for different types of constellations. We evaluate our approach and discuss scenarios in which language learners can playfully explore collocations. Our explorative web tool is freely accessible, generates collocation dictionaries on the fly, and links them to example sentences to ensure context embedding

    Crossing the Border Twice: Reimporting Prepositions to Alleviate L1-Specific Transfer Errors

    Get PDF
    We present a data-driven approach which exploits word alignment in a large parallel corpus with the objective of identifying those verb- and adjective-preposition combinations which are difficult for L2 language learners. This allows us, on the one hand, to provide language-specific ranked lists in order to help learners to focus on particularly challenging combinations given their native language (L1). On the other hand, we provide extensive statistics on such combinations with the objective of facilitating automatic error correction for preposition use in learner texts. We evaluate these lists, first manually, and secondly automatically by applying our statistics to an error-correction task

    Multi-word Adverbs – How well are they handled in Parsing and Machine Translation?

    Full text link
    Multi-word expressions are often considered problematic for parsing or other tasks in natural language processing. In this paper we investigate a specific type of multi-word expressions: binomial adverbs. These adverbs follow the pattern adverb + conjunction + adverb. We identify and evaluate binomial adverbs in English, German and Swedish. We compute their degree of idiomaticity with an ordering test and with a mutual information score. We show that these idiomaticity measures point us to a number of fixed multi-word expressions which are often mis-tagged and mis-parsed. Interestingly, a second evaluation shows that state-of-the-art machine translation handles them well – with some exceptions

    Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database

    Full text link
    We present an approach for searching and exploring translation variants of multi-word units in large multiparallel corpora based on a relational database management system. Our web-based application Multilingwis, which allows for multilingual lookups of phrases and words in English, French, German, Italian and Spanish, is of interest to anybody who wants to quickly compare expressions across several languages, such as language learners without linguistic knowledge. In this paper, we focus on the technical aspects of how to represent and efficiently retrieve all occurrences that match the user’s query in one of five languages simultaneously with their translations into the other four languages. In order to identify such translations in our corpus of 220 million tokens in total, we use statistical sentence and word alignment. By using materialized views, composite indexes, and pre-planned search functions, our relational database management system handles large result sets with only moderate requirements to the underlying hardware. As our systematic evaluation on 200 search terms per language shows, we can achieve retrieval times below 1 second in 75 % of the cases for multi-word expressions
    corecore